BiT: Robustly Binarized Multi-Distilled Transformer

145

Then the gradients w.r.t. β can be similarly calculates as:

Xi

B

∂β

ST E

αClip( Xi

Rβ

α

, 0, 1)

∂β

=

1,

if βXi

R < α + β

0,

otherwise

(5.43)

For the layers that contain both positive and negative real-valued activations i.e., XR

Rn, the binarized values ˆXB ∈{−1, 1}n are indifferent to the scale inside the Sign function:

Xi

B = α · Sign( Xi

Rβ

α

) = α · Sign(Xi

R β). In that case, since the effect of scaling factor α

inside the Sign function can be ignored, the gradient w.r.t. α can be simply calculated as

Xi

B

∂α = Sign(Xi

R β).

5.10.3

Multi-Distilled Binary BERT

Classical knowledge distillation (KD) [87] trains the outputs (i.e., logits) of a student net-

work to be close to those of a teacher, which is typically larger and more complex. This

approach is quite general, and can work with any student-teacher pair which conforms to

the same output space. However, knowledge transfer happens faster and more effectively in

practice if the intermediate representations are also distilled [1]. This approach has been

useful when distilling to student models with similar architecture [206], particularly for

quantization [6, 116].

Note that having a similar student-teacher pair is a requirement for distilling repre-

sentations. While how similar they need to be is an open question, intuitively, a teacher

who is architecturally closer to the student should make transfer of internal representations

easier. In the context of quantization, it is easy to see that lower precision students are

progressively less similar to the full-precision teacher, which is one reason why binarization

is difficult.

This suggests a multi-step approach, where instead of directly distilling from a full-

precision teacher to the desired quantization level, the authors first distilled into a model

with sufficient precision to preserve quality. This model can then be used as a teacher to

distill into a further quantized student. This process can be repeated multiple times, while

at each step ensuring that the teacher and student models are sufficiently similar, and the

performance loss is limited.

The multi-step distillation follows a quantization schedule, Q = {(b 1

w , b 1

a ), (b 2

w , b 2

a ), . . . ,

(b k

w , b k

a )} with (b 1

w , b 1

a ) > (b 2

w , b 2

a ) > . . . > (b k

w , b k

a )1. (b k

w , b k

a ) is the target quantization

level. In practice, the authors found that down to a quantization level of W1A2, and one

can distill models of reasonable accuracy in single shot. As a result, they followed a fixed

quantization schedule, W32A32W1A2W1A1.

BiT, which is shown in Fig. 5.16, combines the elastic binary activations with multi-

distillation obtain, BiT simultaneously ensures good initialization for the eventual student

model. Since the binary loss landscape is highly irregular, good initialization is critical to

aid optimization.

In summary, this paper’s contributions can be concluded as: (1) The first demonstration

of fully binary pre-trained BERT models with less performance degradation. (2) A two-

set binarization scheme, an elastic binary activation function with learned parameters, a

multi-distillation method to boost the performance of binarzed BERT models.

1(a, b) > (c, d) if a > c and bd or ac and b > d.